Stream data compression system using dynamic connection groups

ABSTRACT

A generalized system for compression of data such as text or images. Compression is achieved by identifying new connections between consecutive data units, existing connections or both. A representation of each new connection is then used to replace the consecutive items throughout the data, which are typically considered in pairs. A hierarchical connection structure resembling a tree is built up as data is received, so that the effectiveness of compression increases with time. After each session of data reception, the initially compressed data may be reorganized with new connections being identified and stored in an index for future use, while also further compressing the existing data.

BACKGROUND OF THE INVENTION

This invention relates to compression of data in computer informationsystems, and more particularly to a memory structure and also to methodswhich provide a means of storing information in and retrievinginformation from that structure.

One of the most common uses of computers is that of storing information.In business, this is typically done by a user entering data into adisplay screen via a keyboard, which is then stored in computer memory,either volatile or permanent, such as RAM or rotating disk memory. Also,today, data such as images of documents and photographs is commonlyentered into computers via a scanner. Many other types of data, such assounds, video images and the output of scientific instruments may bestored in computers.

Before storing information, a computer, via the appropriate peripheraldevice, and in some cases with the help of the operating system, firstbreaks down the information into discrete units of data, which will becalled "data units". For example, sound may be broken down into separateaudio samples, images into discrete pixels, and keyboard characters aretransmitted from the keyboard in the form of a series of discrete codes.The discrete units are then recorded as digital data in computer memory.

In order to retrieve information from a computer, the stored discretedata units are played back, via a suitable device, which re-creates theoriginal type of data (an image, sounds, characters etc.) with theappropriate values in the right order, and in so doing, creates a newinstance of the original information.

Perhaps the clearest example is that of character data. In the case ofthis data type, data from the keyboard is typically stored in computermemory as a digital form of the characters which are typed. Mostcomputer operating systems and database management systems record typedinformation in a 1:1 ratio. For example, the typed word "hello" istypically stored as five letters. This is, five data units exist outsidecomputer storage and five data units are stored inside.

Some languages such as Chinese require multi-byte character sets. Eventhough more than one byte of storage may be required to store a singlecharacter, these characters are still stored in a 1:1 ratio because eachcharacter outside computer memory is separately recorded as a singlecharacter inside computer memory.

Recording information in a 1:1 ratio can result in inefficient use ofstorage space. This is particularly so in respect of large amounts ofinformation which contain contiguously repeated data of the same values;that is, information such as images, sounds and large text documents.For examples: text typically includes repeated blanks: images ofdocuments, significant area of white; and pictures, various areas of thesame colour and intensity.

Some computer operating systems use data compression methods to reducethe amount of space required to store information. For example, ratherthan store 30 typed spaces, such an operating system might store onespace and the number 30 along with a flag to indicate that thisinformation is compressed. A variety of compression algorithms arenowadays commonly employed in operating systems to reduce the spacerequired to store contiguous repetitions of the same data values. Suchalgorithms cannot compress non-contiguous repetitions of data values.Other algorithms may compress non-contiguous data, but rely onpre-existing dictionaries of code words which are generally fixed beforethe algorithms are used.

Compression algorithms, traditionally part of major operating systemssuch as VMS and VS, may be a separate product in respect of smalleroperating systems, such as earlier releases of MS-DOS, which do not havecompression. Whether an integral part of an operating system or addedlater, compression algorithms perform an operating system role, that is,they are part of the low-level interface between applications andhardware.

Compression algorithms often function in a client-sever relationshipwith other parts of an operating system. Typically, uncompressed data issupplied to the algorithm by another part of the operating system,called the "client process". The algorithm compresses the data and handsit back, for the client process to use as it wishes. For example, data,once compressed, may be stored by a client process or a derivative of itin a field in a file, or it might be sent to a modem, or it might beused in some other way. The way it is used is not directly of interestto the particular compression algorithm, although different uses maybenefit from different algorithms. The same goes for decompression.

Due to the falling price of computers and particularly computer memory,both volatile and permanent, a recent trend in computer use has beentowards the storage of large data objects such as video images, songs,spoken conversations and large bodies of text.

SUMMARY OF THE INVENTION

The present invention is a method in an information system which allowscompression of non-contiguous groups of data units. It involves ageneral-purpose memory structure whose contents may be any type(s) ofinformation; and it embodies a general-purpose method for organisingthat structure and its contents.

The invention may be used in a variety of different ways whose primaryutility may not be limited to or may not relate to efficiencies in datacompression. The purpose or use of the present invention is thereforeexpressly not limited to the purpose and use exemplified in theembodiments described herein. In may also be used in a wide variety ofcomputer based systems where larger amounts of data are stored orelectronically transmitted to another site.

The invention may broadly be said to consist in a method of compressingdata, comprising the steps of assigning to each data unit a code,recording a repetition of code sequences and upon a pre-determinedthreshold of repetition of any particular sequence assigning to thatcode sequence new code, and using a hierarchal structure built with dataunits and established code sequences to form higher level codesequences.

More particularly the invention may be said to consist in a method ofcompressing data by computer, in which connection groups of two or moreconsecutive data units are identified and recorded as compressionproceeds, comprising:

receiving a stream of data units to be compressed,

storing a record of each received data unit in a processing block ofcomputer memory,

determining whether each received data unit forms an already identifiedconnection group existing in an index block of computer memory, whenconsidered in relation to any immediately preceding data unit orconnection group recorded in said processing block,

storing in said processing block a fresh record of any such alreadyidentified connection group formed by said received data unit, in placeof said received data unit and said immediately preceding data unit orconnection group recorded in said processing block,

determining whether each connection group freshly recorded in saidprocessing block forms a larger already identified connection groupexisting in said index block, when considered in relation to anyrespective immediately preceding data unit or connection group recordedin said processing block,

storing a further fresh record of any such larger already identifiedconnection group, in place of said connection group freshly recorded insaid processing block and said respective immediately preceding dataunit or connection group recorded in said processing block,

delivering the records stored in said processing block for furtherstorage or for transmission as required, once said stream of data unitshas been received and compressed.

As a further feature the invention may be said to consist in a method inwhich at least one data stream has been received, initially compressedand stored, further comprising:

determining a number of occurrences of at least one pair of consecutiverecords in said initially compressed data,

determining whether the number of occurrences of said at least one pairof consecutive records exceeds a threshold number,

storing in said index block a record of a new connection group for anysuch pair of consecutive records in said initially compressed datahaving a number of occurrences greater than said threshold number,

storing in said initially compressed data a record of said newconnection group in place of each occurrence of said consecutive pair ofrecords which said new connection group represents.

BRIEF DESCRIPTION OF THE DRAWINGS

The above described advantages and operation of the present inventionwill be more fully understood upon reading the following description ofthe preferred embodiment in conjunction with the drawings, of which:

FIG. 1 and FIG. 2 form a flowchart illustrating the receiving process.

FIG. 3 is a flowchart illustrating the method within the receivingprocess of finding a connection.

FIG. 4 and FIG. 5 are flowcharts illustrating the reorganising process.

FIG. 7 and FIG. 8 are flowcharts illustrating the retrieving process.

FIG. 9 illustrates the contents of part of a values block.

FIG. 10 illustrates part of an index block, specifically a number ofconnections which constitute the chain associated with the data value"c".

FIG. 11 illustrates the contents of part of an index block, specificallybeing a number of connections which constitute part of the chainassociated with the connection which yields the values "co".

FIG. 12 illustrates a structure of connections which yields the values"company". The parts printed in bold face relate to parts of FIG. 9,FIG. 10, FIG. 11, FIG. 13, and FIG. 15 printed in bold face.

FIG. 13 illustrates the contents over time of part of a processingarray. The rows relate to the receipt of data units over time, and thesecond and subsequent columns to the contents of processing arraylocations.

FIG. 14 illustrates the units of received data, corresponding unit ofstored data, and the corresponding units of retrieved data.

FIG. 15 illustrates a structure of connections part of which containsthe same sub-structure as the connection illustrated in FIG. 12.

FIG. 16 illustrates a structure of connection in respect of a data unittype of words.

FIG. 17 is a perspective view of a computer system on which theinvention might be used.

FIG. 18 is a generalised software system which may be implemented on thecomputer station of FIG. 17.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

A preferred embodiment of the present invention is disclosed herein:however, it is to be understood that the disclosed embodiment is merelyexemplary of the invention, which may be embodied in various forms.Therefore, specific structural and functional details disclosed hereinare not to be interpreted as limiting, but merely as a basis for theclaims and as a representative basis for teaching one skilled in the artto variously employ the present invention in virtually any appropriatelydetailed structure.

The values in the diagrams of particular address/contents numbers aresmaller than might in a given embodiment be the case. Smaller numbershave been used to aid clarity. In respect of the diagrams, characterdata units have been used to assist clarity and economy of space. Thisshould not be interpreted as conferring any particular importance tothis data type and data unit type. Written examples are given in respectof a data unit of words to assist easy understanding.

Before going further it is necessary to clarify some terminology. Thereis a problem with saying "the address of the integer is at the firstposition in the string". This statement may mean either that the firstposition in the string contains a number which identifies the memorylocation at which the integer starts, or alternatively, that the integerstarts at the first position in the string. Accordingly, when the numberis the subject of attention, the term "address" is used; and when theplace itself is the issue, "location" is used.

Furthermore, "the item is stored at address such and such" is understoodto mean that the item is stored in memory starting at the location whoseaddress is such and such, and that the item, depending on what type ofitem it is, may occupy more than one addressable unit of storage space.In such cases, the term "the next available address" is understood totake into account the number of addressable units of storage space thatitems require.

In addition, when statements are made such as "the operating systemexecutes a read operation and reads the address then transfers itsattention to the connection starting at the location identified by thataddress", it is understood and is consistent with the present inventionthat operations described by such statements may be encapsulated in somedifferent form or facility of a particular computer programminglanguage, and in respect of that programming language, may be referredto using names which are different to those used herein. Thedescriptions contained herein are intended to be generic, and to notpre-suppose any particular language.

The system which manages the operation of the present invention iscalled herein the "operating system" and this is understood herein tomean only the operation of the present invention, rather than to otheraspects of a computer system.

The present invention is a method of storing information such that newinformation may be stored in a relatively smaller amount of space. Thisis achieved by maintaining an index which represents commonly repeatedgroups of data units that have been received in the past. When a newdata stream is received which contains an instance of an indexed group,all that need be stored is a code, typically the address of the locationin the index of that group, rather than the group itself. For example,if the data "Sunset Boulevard" is received, and representation of itsexists in the index at, say, address 123456, all that need be stored isa single number (the memory address 123456) rather than 16 characters.

Therefore, the present invention is conceptually divided into threeprocesses: the receiving process, the reorganising process, and theretrieving process.

The receiving process manages the receipt and processing of new datastreams. Data streams are composed of data units: for examplecharacters, audio samples, image pixels. The inputs to the receivingprocess are the new data stream, and entries in the index. The output isa sequence of addresses which represents the received data stream, andit is this sequence of addresses which is handed back to the clientprocess or on to another process.

The reorganisation process adds new information to the index based onrepetitions in address sequences stored since the last reorganisationsession. The inputs to the reorganisation process are stored addresssequences, and existing entries in the index. The output is additionalentries in the index, and optionally, changes to address sequences.

The retrieval process manages the retrieval of information which waspreviously stored. The inputs to the retrieval process are storedaddress sequences, and entries in the index. The output from theretrieval process is the information which was originally stored, thatis, a new instance of the original data stream, and it is this sequencesof values that is handed back to the client process or on to anotherprocess.

Memory is conceptually divided according to its use into 3 blocks: thevalue block, index, and processing block. Such memory may be volatile orpermanent, or consist of a combination of both. For example, a givenembodiment may instantiate processing memory as RAM, and the index andvalue blocks as rotating disk memory.

The values block holds a single instance of each different data unitvalue that has been received in the past. Different types of data may bereceived, including, sound samples, image pixels, and characters; andwithin a single data type, there may in different instances, bedifferent types of data units. For example, the character data typemight consist of a data unit type of characters or a data unit type ofwords. In respect of discrete characters, the values block would containan instance of each character received to date. In respect of acharacter data unit type of words, the values block would contain aninstance of each different word received to date.

For the purpose of clarity, data values are allocated their own(conceptual) memory block, but they are really part of the index.Associated with a data value is a location reserved for the address,called the "associated address", of a chain. A chain is a group ofconnections in the index. An example portion of a values block is shownin FIG. 9 and will be explained in detail below.

The index block consists of connections. Typically, a mature index blockcontains a number of connections. A connection is a group of values. Theaddress of the memory location at which the first value starts is calledthe address of the connection. A connection conceptually connects twomemory locations. A connection is said to have a direction, which isfrom the first location connected, to the second. Connections themselvesmay be grouped together into chains. A chain consists of all the one ormore connections which are connected to the same first location. Theaddress of the first connection in a chain is called the address of thechain.

Connections may form a hierarchical connection structure. A connectionstructure looks diagrammatically like an inverted tree, with oneconnection at the apex, and data values at the bottom. The nodes in thetree structure, that is the places where a branching occurs, areconnections, and connections in the structure are said to be ondifferent levels. The lowest level is at the bottom and the highest atthe top. An example connection structure is shown in FIG. 12 and will beexplained in detail below.

It should be noted, however, that the structures formed by a group ofconnections will not necessarily represent meaningful words. Connectionsmay be formed between a common suffix of one word, such as "ing", anintervening space, and a common prefix of a following word, such as"an". Spaces may therefore be treated in the same manner as othercharacters during the receiving, reorganising and retrieving processes.

A connection contains four reserved places and other information used bythe operating system. In respect of the present embodiment, the firstplace in the connection is reserved for the address of the first of thetwo locations between which the connection subsists. The second place isreserved for the address of the second location. The third place isreserved for the address of the associated chain, if any, that is, forthe address of the chain in the next level up. The fourth place isreserved for the address of the next connection, if any, in the chain onthe same level, and of which the current connection forms a part. Anexample of a number of connections forming a chain is shown in FIGS. 10and 11 and will be explained in more detail below.

The processing block provides a number of structures of reservedlocations which may be used by the receiving, reorganising andretrieving processes for a number of purposes including temporarystorage and manipulation of addresses. A structure called the "inputstring", may be used by the receiving process to temporarily store averbatim record of the received data stream. A structure called the"address string" be used by the receiving process to store the sequenceof addresses which represent the received data stream. A structurecalled the output string may be used by the retrieving process to storethe new instance of the originally received data stream. A structurecalled the processing array is used by the receiving process in findingin the index already indexed data groups. An example indicating use ofthe processing array is given in FIG. 13, again to be explained indetail below.

Referring now to the drawings and more particularly to FIG. 1 there isshown a flow diagram of the receiving process.

When a data unit is received by the structure operating system the dataunit is processed 105. Because of the variable amount of time that matbe required for execution of the loop 110, the received data units mayoptionally be buffered 108. If the incoming data stream is buffered,then the next data unit is got from the buffer 110. The operating systemthen executes a search operation 118. The search is executed against thevalues in the values block using the value of the current data unit asthe search key.

The search operation seeks to achieve a match. For example, when thedata unit "street" is received by the operating system, the operatingsystem searches for the word "street" among the stored data values.

When the match operation fails 118, that is, when the value of the dataunit is not present in the values block, the operating system executes awrite operation and writes the value to an available location in thevalues block 120. When the match succeeds 118 the operating systemidentifies the address of the location which holds the found data value.

When the current data value is the first this receiving session 125, thepointer PP is set to the address of the first array location 128. Thelocation to which pointer PP points is called "location PP". The addressof the current data value is written to location PP 135.

The operating system then returns 115 and starts to process the nextdata unit 110.

Now referring to FIG. 2 which is a continuation of FIG. 1, the operatingsystem executes a write operation and writes the address of the nextdata unit's corresponding data value to the next available location(PP+1) in the processing array 210.

There is now a pair of addresses in the processing array. The operatingsystem now determines whether a connection in the index exists betweenthe two addresses in this pair 215. (FIG. 3 illustrates the process ofseeking a connection in the index.) If a connection exists then there isa connection where the first address in the connection is the same asthe first address in the pair, and similarly, the second address in thesame connection is the same as the second address in the pair.

When a connection is found between the addresses in the pair, theoperating system writes the address of the connection to location PP 218then returns and gets the next data unit 110. Pointer PP is notincremented. This has the effect of overwriting the previous address atlocation PP, and the address at location PP+1 is ignored (it will beoverwritten with the address of the next data value). This is donebecause now that a connection is found, neither of the addresses in thepair need be stored. Only the address of the connection need beretained, because the addresses of the pair can always be found byunpacking the information contained in the connection.

When a connection is not found between the addresses in the pair 215,and location PP is the first location relating to this data stream 220,pointer PP is incremented by one 225, and the next data unit processed110.

When a connection is not found and location PP is not the first locationthis receiving session, the pointer PP is decremented by one 228. Theoperating system then evaluates the pair starting at the new locationPP, and searches for a connection in the index between the addresses inthis pair 235.

When no connection is found, the operating system increments the primarypointer PP by one 240, therefore ensuring that the second address in thecurrent pair is not now overwritten when the next data unit isprocessed, then gets the next data unit 110.

When a connection is found, the operating system writes the address ofthe connection to location PP 238. The effect of this is to overwritethe old address in location PP, and because pointer PP is decremented228 before getting the next data unit 110, to discard the second addressin the old pair. The reason for this is that now a connection has beenfound, there is no need to keep the pair, only the address of theconnection which connects them. Discarding the second address in the oldpair leaves a gap in the array, and this gap is closed up by executing acopy operation and copying the value at location PP+2 into locationPP+1, 232, which is the location of the discarded address.

Referring now to the drawings and more particularly to FIG. 3 there isshown a flow diagram of the method within the receiving process offinding a connection, referred to in conditional branches 215 and 235 ofFIG. 2. For the sake of exposition, the two addresses between which aconnection is sought are given the names A1 and A2, and it is understoodthat in respect of FIG. 2 conditionals 215 and 235 that they refer tothe addresses in locations PP and PP+1 respectively.

The operating system executes a read operation and reads the firstaddress (A1) in the pair. The operating system then shifts its attentionto the item at the location of that address 305. This item may be eithera data value or a connection 308.

When it is a data value, the operating system looks for an associatedaddress of a chain 310. If there is no associated address then the datavalue is not connected to anything 335 and the search process ends andreturns failure. When an associated address does exist, then there is achain in the next level up in a connection structure, and the operatingsystem sets a pointer, called the "search pointer" (SP) to that address318.

When the first address (A1) is the address of a connection, theoperating system looks in the third place in the connection for theaddress of a chain 315. When there is no address in the third place thenthe connection is not connected to anything 335 and the search processends and returns failure.

When there is a chain's address in the connection's third place 315, theoperating system sets pointer SP to that address 318. This is theaddress of the associated chain. That is, if there is a chain associatedwith address A1, search pointer SP is now set to the address of thatchain.

The address of the chain is also the address of the first connection inthe chain. The connection which pointer SP points at its called"connection SP". The operating system reads the second address inconnection SP and executes a match operation against address A2 320.

When connection SP's second address does not match A2, that is, whenconnection SP does not connect A1 to A2, the operating system movesalong the chain, looking at the second address in each connection in thechain 325, 328, 320. If the address A2 is found 320, the process endsreturning success and the address of connection SP. If the end of thechain is reached and A2 was not found in any of its connection secondplaces 325, then the process ends 335 returning failure.

Referring now to FIG. 4 there is shown a flow diagram of thereorganising process.

Connections are created by the operating system when it reorganises theindex block. When an instance of the storage structure of the design ofthe present invention is first used, there are no connections in theindex block, given that none has been imported from another computer,from storage media, or from elsewhere.

During a reorganisation session, the reorganisation process processes,typically in the order each was written, each stored address sequencewhich has not previously been processed by the reorganisation functionin a prior reorganisation session. In a particular usage, the sequencemay be, for example, the contents, which are addresses, of variousfields in various files.

Such addresses are addresses either of data values or connections. Thesequences of stored addresses represents the streams of data which werereceived. The representation is usually not a 1:1 representation; and asthe index gets older, the same information will on subsequent occasionsit is received, typically be stored as fewer addresses.

The operating system sets the reorganisation pointer RP to the firstaddress of the first sequence to be processed this reorganisationsession 405. It then starts execution of a loop operation 408 whichreads each address in each sequence to be processed. Within a singlereorganisation session, the operating system may read each such address,or its replacement, in each such sequence, a number of times.

Within the loop starting at 408, the operating system looks at eachcontiguous pair of addresses in a sequence. These pairs of contiguousaddresses are referred to by the shortened term "pairs".

Typically, the second address of a pair would represent the informationwhich was received by the operating system after receipt of the datawhich the first address represents. Another embodiment may formulatepairs in the reverse order, that is, where the first address of the pairrepresents data received directly after those represented by the second.In this latter case other functions of the operating system would needto take account of this reverse order within pairs. However, whether oneway round or the other, the addresses in a pair must represent sectionsof data which were received by the operating system next to each otherin time, that is, which were temporally contiguous.

For each contiguous pair of addresses, the operating system counts thenumber of times it occurs within all sequences to be processes thisreorganisation session 410.

When no repetition is found, or when the frequency of repetition is lessthan or equal to a number called the "connection creation threshold" 415the operating system increments the loop 418 and processes the nextcontiguous pair 408. The first address in the next pair is the secondaddress in the current pair.

The operating system counts the frequency of occurrence of pairs whichconsists of the same addresses in the same order. The addresses whichmake up the pair must be in the same order. For example, the address ina sequence which yields the word "Sunset" and the address in a sequencewhich yields the word "Boulevard" might be found to occur together inthat order 15 times during a reorganisation session. A pair of the sameaddresses in the reverse order is a different pair, not a differentinstance of the same pair, and might not normally be found.

When the current pair is found to be repeated with a frequency greaterthan the connection creation threshold, a connection is created in theindex block in respect of that pair 420. The connection creation processis illustrated in FIG. 5.

The number of times a pair must be repeated to trigger the creation of aconnection is dependent on the particular embodiment of the presentinvention, its particular use, and its maturity, and the data type inquestion, and would be expected to vary between embodiments. Aconnection between two given addresses in a given order, may be createdonce only. An embodiment may create connections after an invariantnumber of repetitions, or on some other basis, for example, on the basisof the top 20% of frequencies within the current reorganisation session,or, in order to moderate the growth of the index, as a function of theindex's age or size.

After a connection is created in the index 420 and FIG. 5 the address ofthe connection is written to the location of the first address in thepair 425 overwriting the original first address in the pair. The reasonis that now a connection has been created between the two addresses inthe pair, it is not necessary to keep both addresses; only the addressof the connection need be retained. The second address in the pair isnow redundant information, because this address is contained in theconnection. The location which contains the second address is nowignored 428, 430. As far as the reorganisation process is concerned itdoes not exist. Various means may be used to achieve this end, forexample, the rest of the sequence might be moved left one location tofill up the gap, or the location which holds the second address might belogically ignored, for example, where the first address in the pair isin location PP, the location in the sequence now pointed to by RP+1 isthe location that would previously have been pointed to by RP+2.

After the location of the second address is removed from the sequence,the operating system tests whether the session is ended 435 or thesequence is ended 438. When the sequence is ended the next sequence isfound 440 and the processing of that sequence started 408.

Referring now to FIG. 5 there is shown a flow diagram of the methodwithin the reorganising process of creating a connection.

The operating system identifies the next location available in the indexblock for creation of a new connection 505, and sets a pointer, calledthe connection pointer (CP), to that location 508. The connection atthat location is called "connection CP". The operating system thenwrites, starting at that location, the values which constitute theconnection.

The first of the two addresses in the repeated pair is written to one ofthe places in connection CP; and the second, to another. In the presentembodiment, the first address in the pair is written to the first placein the connection 510, although in some other embodiment it may bewritten to some other place in the connection; and likewise in thepresent embodiment, the second address in the pair is written to thesecond place in the connection 518. These addresses will, for themoment, be called the "first address" and the "second address" within aconnection in virtue of being held respectively in the first and secondplaces.

The operating system then updates other existing items in the storagestructure in the manner illustrated in FIG. 6. 520. Then the connectioncreation process ends.

Referring now to FIG. 6, which is referred to in 520 of FIG. 5.

When the first address in the new connection CP is the address of a datavalue 605, the operating system determines if it has an associatedaddress 608. When it doesn't, the operating system writes the address ofconnection CP to the place reserved for the data value's associatedaddress 610 then ends the connection creation process.

When there is an associated address, then the connection pointer CP2 isset to the location of this address 615. The existence of an associatedaddress means a chain exists in respect of this data value, on the nextlevel up in a connection structure.

When the item at the location of connection CP's first address is itselfa connection 605-N, the operating system sets the connection pointer CP2to the address of this connection 615 then looks at the place inconnection CP2 reserved for the address of the associated chain, if any618. In the preferred embodiment this place is the third place in aconnection, and an address there is called the "third address" in virtueof being in the third place.

When the third place in connection CP2 does not contain the address of achain 618-N, the operating system executes a write operation and writesthe address of connection CP to the third place in connection CP2 630then writes the value zero to the fourth place, which identified thisconnection as the last in the chain.

To reiterate, a chain is a group of one or more connections each ofwhich has the same first address. The address of the first connection inthe chain is said to be the address of the chain. The chain "associatedwith" a given connection is the chain whose connections have as theirfirst address the address of the given connection.

In the present embodiment each connection in the same chain, except thelast, holds the address of the next connection in the chain. Otherembodiments may store this information in some other form and/or place.

In the present embodiment, the fourth place in the connection isreversed for the address, if any, of the next connection in the chain.An address in this place is called the "fourth address" in virtue of itbeing in the fourth place.

The storage system contains the information which identifies the lastconnection in each chain. In the present embodiment, this informationtakes the form of a zero in the fourth place of the last connection,which identifies it as the last connection. Typically the order of theconnections in a chain reflects the order in which the connections werecreated.

When the third place in connection CP2 does contain the address of achain, the operating system then seeks to find the end of thatassociated chain then add the new connection to the end. The operatingsystem sets the third connection pointer CP3 to the connection at thestart of the chain 620.

The operating system then reads the value at the fourth place inconnection CP3 628. This place is the place reserved for the address ofthe next connection in the chain.

When the fourth place contains the value zero, connection CP3, the firstconnection in the chain, is also the last in the chain, that is, it isthe only connection in the chain. The operating system then executes awrite operation and writes the address of connection CP to the fourthplace in connection CP3 635.

When the fourth place contains a valid connection address 625 theoperating system sets the third connection pointer CP3 to thatconnection 620 which then becomes the new connection CP3. Thisconnection is the next connection in the chain.

This loop of reading the fourth address then going to the connection ofthat address continues until the value at the place reserved for thefourth address is zero, that is, until the end of the chain is reached628-Y.

When the end of the chain is reached, the address of connection CP iswritten to the fourth place in connection CP3 635, therefore then makingconnection CP3 the next to last connection, and making the newconnection the last connection. The value zero is written to the fourthplace in connection CP. The connection creation process then ends.

Other mechanisms may be employed by a practitioner skilled in the art toinstantiate this design of the structure of memory addresses and values;for example, by identifying the end member of the chain of connectionsby setting a flag in a part of the connection reserved from operatingsystem-specific information other than addresses, or of holding theconnection addresses in an index, rather than employing pointers to linkone connection to the next. These variations are valid instantiations ofthe design and method of the present invention.

When a new connection is created in the index block, instances in theaddress sequence of the two relevant repeated values may be replacedwith the address of the new connection, and the reorganisation processrun again over the addresses being processed in the currentreorganisation session. In this manner, groups in address sequenceswhich are repeated and which consist of three or more addresses, may beestablished as structures of connections in the index.

Referring now to FIG. 7 there is shown a flow diagram of the retrievingprocess.

From each stored address the retrieving process unpacks the connectionstructure, if any, which branches out below the connection at thataddress in the index. The operating system thereby builds up the valueswhich were originally received, and which the address represents.

In retrieving information previously stored, firstly, the operatingsystem reserves an area of memory in the processing block called theoutput string. The output string is used by the operating system as aplace to put the data values which constitute the retrieved information,when it determines those data values. In some embodiments, an outputstring may not be implemented, the retrieved data values being passeddirectly to the client process or on to another process.

The retrieving process processes each address in each address sequence705. The operating system reads the next address in the sequence 708,goes to that address 710 and retrieves the information which thataddress represents 715. FIG. 8 illustrates the process of unpacking suchinformation. The operating system then tests for the end of the sequence718 and when true 718-Y exits the retrieving process.

Referring now to FIG. 8. which is referred to in FIG. 7 item 715.

The operating system determines the type of the item at the currentaddress 518. The type may be either data value or connection.

When it is a data value, the operating system writes the data value tothe next available position in the output string 810.

When the address is the address of a connection 805-N, the operatingsystem executes a loop 808 and reads down the left branches of theinverted tree connection structure (of which FIG. 12 is an example)which branches out below that address through various nodes(connections) to determine the data values on its lowest level.

When a data value is found and written to the output string, theoperating system executes a conditional branch 815. When there are nohigher levels (such as L0-L4 in FIG. 12) the unpacking process ends.

When a higher level exists, the operating system goes up one level 818.The operating system examines the connection on this higher level todetermine whether the right hand branch has been read previously 820.When it hasn't, the operating system goes down the right hand branch andexecutes the loop starting in 805.

When the right hand branch has been previously read, the operatingsystem checks to see if there is a higher level 822, and if there is,goes up to that level, then executes the loop starting in 820. Whenthere is not a higher level, then the unpacking process ends.

The result of this process is that new instances of the original datavalues are written to the output string in the order in which they wereoriginally received, thereby re-creating the original information.

This storage and retrieval process of the present invention allowsefficient data storage in respect of certain types of data. For example,In a conventional storage system, when a user types "East 42nd Street"the letters are typically stored in a 1:1 ratio between typed items andstored items. Whereas in respect of the present invention, thisinformation might be stored in a 16:1 ratio as one memory address whichrepresents three words (taking "42nd" to be a word) which add up to 16characters.

Considering a local body database which records the addresses ofbuildings. There are, say 1,000 buildings in East 42nd Street. Ratherthan storing the name East 42nd Street 1,000 times (consisting of atotal of 16,000 characters), only 1,000 memory addresses need be stored.

Considering a national database of property addresses. The word "Street"may be stored, say, 20 million times, taking up 120 million characters.In respect of the present invention, only 20 million memory addressesneed be stored. Even if a memory address is taken to occupy 4 bytes ofstorage and a character, one byte, the method of the present inventionin this example represents a saving by compressing non-contiguousrepetitions of groups of data values.

In the case where data units are words and numbers, certainsub-processes are required to manage peculiarities which do not arise inrespect of other types of data and data units. Spaces, capitalization,punctuation, and special characters may be processed by specialsub-processes. For example, a space may be considered to be appended toa word; or to delimit a data value (word) rather than constitute a datavalue itself. A punctuation mark may be always represented as theaddress of its respective data value, and not form parts of connections;capitalization may be ignored, or alternatively, might be identified bya connection flag or series of connection flags which identify theletter or letters capitalised.

In the case where data units are characters, a data stream, of Englishtext for example, would have no need for any particular type or varietyof data unit to be predetermined as a delimiter. The connectionstructure would therefore be formed including spaces or any other dataunit which might ordinarily be used to separate words, as mentionedabove. Data units considered in this fashion would be of a generallyuniform length, such as a seven bit ASCII character, possibly with orwithout a number of additional bits.

Referring now to FIG. 9 there is shown a specific illustrative exampleof the contents of part of a data values block. For example a character"a" 901 is stored in a location of a certain address 902 (namely address100586) and associated with which is an "associated address" 903 (namely103765).

Referring now to FIG. 10 there is shown a specific illustrativeembodiment of a chain of connections which starts at address 219550 1002and which are associated with the data value "c". Where there is a firstaddress 1001 in the first connection which is the address of the firstitem connected ("c"), a second address 1003 which is the address of thesecond item connected (respectively "a", "e", "o" . . . moving down thepage), a third address 1004 which is the address of the associated chainwhich is on the next level up in the connection structure (arbitrary inthis example), a fourth address 1005 which is the address of the nextconnection in the chain on the same level, and a place for otherinformation used by the operating system 1006.

Referring back to the data value "c" at address 100650 in FIG. 9 andreferring to the connection at address 327645 in FIG. 10 a particularconnection establishing "co" can be seen.

Referring now to FIG. 11 there is shown a specific illustrativeembodiment of the chain associated with the address that yields "co".The second address in each case indicates the various connectionsbetween "co" and other data values (respectively "a", "o", "n" . . .moving down the page). The second address could also normally indicateanother connection.

Referring now to FIG. 12 there is shown a specific illustrativeembodiment of a connection structure which yields the word "company"using addresses from FIGS. 9, 10, 11. Each location where a branchingoccurs is called a level L, and the bottom of the structure is calledthe bottom level L0. For example, the leftmost branches travel downindirectly through nodes, which are connections, and different and lowerlevels to the data value "c". There are five levels between (andincluding) the data value "c" and the address 890123, illustrated by thevalues L0-L4 on the left side of the figure. Whereas there are only twoadditional levels between the data value "y" and the address 890123,illustrated by the values L0-L2 on the right side of the figure.

890123 is the address of the top-level connection and the address of thestructure. It is this single address which, in this example, is storedwhen the word "company" is received as a character data stream by theoperating system. The numbers below this address illustrate a connectionstructure which yields the word "company", and would typically be set upas a result of reorganisation following a number of receiving processesaccording to FIGS. 1 to 6.

In order to later reproduce from the structure what was originallyreceived, the operating system from the single top-level address 890123follows the lower level connections down and across the structure toyield the word "company", in the manner illustrated in FIG. 7 and FIG.8.

Connection structures of different levels and branchings might also, ina different embodiment or in the same embodiment at a different time,point to the word "company". Alternatively, there might not be a singletop-level address which yields the word "company". The word "company"might, for example, be stored as two addresses which yield, throughtheir two respective connection structures, the strings "comp" and"any". Or the word "company" could, in a poorly organised or youngsystem, be stored as the addresses of its data values: 100650, 100610,100634, 100682, 100586, 100666 and 100647 or some crude abstraction ofthem, such as 327645, 100634, 100682, 321098 and 100647.

Referring now to FIG. 13 there is shown a specific illustrativeembodiment of the contents of a processing array during receipt of theword "company", given that the connections shown in FIG. 12 are alreadyin existence. The addresses in the array locations relate to theaddresses in FIG. 9, FIG. 10, FIG. 11, FIG. 12 and FIG. 14. The processwhich operates in respect of this array is illustrated in FIG. 1, FIG. 2and FIG. 3.

The letter or data unit "c" is first received and identified in relationto address 100650, which is written in location 1 of the processingarray. The letter "o" is then received and its address in the valuesblock is written to location 2. A connection is then found to exist ataddress 327645 and replaces both, in location 1. The letters "m" and "p"are then received, and again existing connections are identified,resulting the address 795228 being stored as a compression of "comp".

The letter "a" is then received, but no connection between 795228 and100586 exists. The letter "n" is then received and a connection between100586 and 100666 is identified, and stored as address 321098,representing "an". Similarly a connection between 321098 and 100674 isidentified as 678901 on receiving the letter "y". Finally an existingconnection between 795228 and 678901 ("comp" and "any") is identifiedand stored as 890123.

Referring now to FIG. 14 there is shown a specific illustrativecomparison between what is received, what is stored, and what isretrieved. This simply shows that improvement of 7:1 has been achievedin storing the word "company" at a single memory address.

Referring now to FIG. 15 there is shown a specific illustrativeembodiment of a connection structure which points to the word "common"which illustrates an efficiency which may be achieved in informationstructures of this design. Namely that the same connection substructure327651 which exists in FIG. 12 in respect of the word "company" alsoexists in FIG. 15 in respect of the word "common". Here connection327651 ("com") is connected to 100634 ("m") rather than 100682 ("p").Connectors 932655 and 795228 could form part of a chain.

Referring now to FIG. 16 there is shown a specific illustrativeembodiment of a connection structure which yields the street name "East42nd Street". This indicates how whole phrases or even sentences may becompressed as a result of reorganisation processing, in respect of adata unit type being a complete word, providing they occur sufficientlyfrequently.

Referring to FIG. 17 there is shown by way of example only, forcompleteness of the description, a desktop computer station in which asystem incorporating software according to the present invention couldbe implemented. It will be understood that the system could also beimplemented in a wide range of computer or communications equipment orother equipment for the purpose of data storage and compression.

Although the specific connections are not shown, the work stationcomprises a keyboard 10 for user input, which would normally beconnected to a processor/disc drive box 11, and in turn to a videodisplay unit 12. Other items of equipment such as a data scanner, modemor printer may or may not also be present. The station might also beconnected as part of a network and server system. Data entered throughthe keyboard or downloaded from an external source could be compressedand stored at the station according to the invention.

Referring to FIG. 18 there is shown again by way of example, which willbe fully appreciated by the skilled person, a generalised softwaresystem which may be implemented on the computer station of FIG. 17. Thework station is controlled by operating system software 20 whichfunctions in conjunction with a number of application programs 23 whichmay be chosen by a user. Data compression according to the presentinvention may be implemented as part of the operating system 20 or as aseparate application program 23. Data may be input from a variety ofsources such as the keyboard or a scanner, through a data inputinterface 21. Compressed data may be output to an external storagemedium such as a disc drive, or transmitted to a remote site, through adata output interface 22.

The particular method of implementing the present invention may varydepending on a number of factors including the particular computer,type(s) of data, programming language, and the intended use of theinvention. In adapting the teachings of the present invention todifferent applications, those of ordinary skill in the art will modifythe preferred embodiment described herein. Accordingly, the inventionshould not be limited by the foregoing description of the preferredembodiment, but rather should be interpreted in accordance with thefollowing claims.

I claim:
 1. A method of compressing data by computer, in whichconnection groups of two or more consecutive data units are identifiedand recorded as compression proceeds, comprising:receiving a stream ofdata units to be compressed, with no predetermined variety of data unithaving significance as a delimiter of other data units, storing a recordof each received data unit in a processing block of computer memory,determining whether each received data unit forms an already identifiedconnection group existing in an index block of computer memory, whenconsidered in relation to any immediately preceding data unit orconnection group recorded in said processing block, storing in saidprocessing block a fresh record of any such already identifiedconnection group formed by said received data unit, in place of saidreceived data unit and said immediately preceding data unit orconnection group recorded in said processing block, determining whethereach connection group freshly recorded in said processing block forms alarger already identified connection group existing in said index block,when considered in relation to any respective immediately preceding dataunit or connection group recorded in said processing block, storing afurther fresh record of any such larger already identified connectiongroup, in place of said connection group freshly recorded in saidprocessing block and said respective immediately preceding data unit orconnection group recorded in said processing block, delivering therecords stored in said processing block for further storage or fortransmission as required, once said stream of data units has beenreceived and compressed.
 2. A method according to claim 1, in which atleast one data stream has been received and stored, furthercomprising:determining a number of occurrences of at least one pair ofconsecutive records in said received data, determining whether thenumber of occurrences of said at least one pair of consecutive recordsexceeds a threshold number, storing in said index block a record of anew connection group for any such pair of consecutive records in saidreceived data having a number of occurrences greater than said thresholdnumber, storing in said received data a record of said new connectiongroup in place of each occurrence of said consecutive pair of recordswhich said new connection group represents.
 3. A method according toclaim 2, comprising, repeatedly, storing records of connection groupsrepresenting consecutive pairs of records having at least the thresholdnumber of occurrences, until no further such pairs can be found.
 4. Amethod of compressing data according to claim 1, furthercomprising:determining whether each received data unit exists in avalues block of already identified data units or is a new data unit,storing a record of each new data unit in said values block.
 5. A methodaccording to claim 1 wherein the data units are of a substantiallyuniform length.
 6. A method according to claim 5 wherein the data unitsare characters or image pixels.
 7. A method of compressing data bycomputer, comprising, repeatedly, receiving a new data stream andstoring an initially compressed data stream wherein connection groups oftwo or more consecutive data units are identified and recorded ascompression proceeds by performing the steps of:receiving a stream ofdata units to be compressed, with no predetermined variety of data unithaving significance as a delimiter of other data units, storing a recordof each received data unit in a processing block of computer memory,determining whether each received data unit forms an already identifiedconnection group existing in an index block of computer memory, whenconsidered in relation to any preceding data unit or connection grouprecorded in said processing block, storing in said processing block afresh record of any such already identified connection group formed bysaid received data unit, in place of said received data unit and saidimmediately preceding data unit or connection group recorded in saidprocessing block, determining whether each connection group freshlyrecorded in said processing block forms a larger already identifiedconnection group existing in said index block, when considered inrelation to any respective immediately preceding data unit or connectiongroup recorded in said processing block, storing a further fresh recordof any such larger already identified connection group, in place of saidconnection group freshly recorded in said processing block and saidrespective immediately preceding data unit or connection group recordedin said processing block, delivering the records stored in saidprocessing block for further storage or for transmission as required,once said stream of data units has been received and compressed, andreorganizing said initially compressed data stream by creating andstoring new connection groups wherein at least one data stream has beenreceived and stored by performing the steps of: determining a number ofoccurrence of at least one pair of consecutive records in said receiveddata, determining whether the number of occurrences of said at least onepair of consecutive records exceeds a threshold number, storing in saidindex block a record of a new connection group for any such pair ofconsecutive records in said received data having a number of occurrencesgreater than said threshold number, storing in said received data arecord of said new connection group in place of each occurrence of saidconsecutive pair of records which said new connection group represents.8. A structure in computer memory, comprising a plurality of records ofdata units and a plurality of records of connections, wherein:eachrecord of a data unit represents the respective data unit, and thepresence or absence of a chain of records of connections in which thedata unit is represented, and each record of a connection represents arespective connection relationship between two data units, twoconnection relationships, and data unit and a connection relationship,or a connection relationship and a data unit, and the presence orabsence of a chain of records of connections in which the respectiveconnection relationship is in turn represented as part of furtherrelationships.
 9. A structure according to claim 8, wherein each recordof a connection has primary and secondary components which represent thetwo data units, the two connection relationships, the data unit andconnection relationship, or the connection relationship and data unit.10. A structure according to claim 9, wherein each record of aconnection has a further component which represents the presence orabsence of another record of a connection having a primary component incommon with said each record of a connection.
 11. A structure accordingto claim 9, wherein each record of a connection has a further componentwhich represents the presence or absence of another record of aconnection having a primary component which represents said each recordof a connection.
 12. A structure according to claim 9, wherein each saidchain of records of connections in which the data unit is representedcontains every such record of a connection in which the data unit isrepresented as the primary component.
 13. A structure according to claim9, wherein each said chain of records of connections in which therespective connection relationship is represented contains every suchrecord of a connection in which the connection relationship isrepresented as the primary component.
 14. A structure according to claim9, wherein the primary and secondary components of each record of aconnection are related by their consecutive occurrence in a data stream.15. Computer apparatus containing program instructions which create adata structure according to claim
 8. 16. A method of compressing a datastream, in which a groups of data units are replaced by connectionrepresentations from an existing data structure, including the stepsof:(a) storing a newly received data unit in a processing array whichhas already received at least one data unit, (b) searching the datastructure for a connection representation to replace the newly receiveddata unit and a preceding data unit or preceding connectionrepresentation in the array, (c) storing the connection representationin the array if located in (b) as a replacement for the newly receiveddata unit and the preceding data unit or connection representation asthe case may be, (d) searching the data structure for a furtherconnection representation to replace the newly stored connectionrepresentation in (c) and a respective preceding data unit or precedingconnection representation in the array, (e) storing the furtherconnection representation in the array if located in (d), as areplacement for said newly stored connection representation and saidrespective preceding data unit or preceding connection representation inthe array, (f) repeating steps (d) to (e) until no further connectionrepresentation is located, and (g) repeating steps (a) to (f) until nofurther data unit is received.
 17. A method according to claim 16further including the steps of:(h) scanning the data stream which hasbeen compressed according to steps (a) to (g) to determine the number ofoccurrences of each consecutive data unit pair, connectionrepresentation pair, data unit and connection representation pair, orconnection representation and data unit pair, (i) storing a newconnection representation in the data stream to replace each pairdetermined in (h) for which the number of occurrences is greater than athreshold number, and (j) storing each new connection representation inthe data structure.
 18. A method according to claim 16 wherein nopredetermined variety of data unit has significance as a delimiter ofother data units.
 19. Computer apparatus containing program instructionswhich implement a method according to claim 16.